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Abstract 

For nearly any challenging scientific problem evaluation of the likelihood is problematic if not 
impossible. Approximate Bayesian computation (ABC) allows us to employ the whole Bayesian 
formalism to problems where we can use simulations from a model, but cannot evaluate the likeli- 
hood directly. When summary statistics of real and simulated data are compared — rather than 
the data directly — information is lost, unless the summary statistics are sufficient. Here we em- 
ploy an information-theoretical framework that can be used to construct (approximately) sufficient 
statistics by combining different statistics until the loss of information is minimized. Such sufficient 
sets of statistics are constructed for both parameter estimation and model selection problems. We 
apply our approach to a range of illustrative and real-world model selection problems. 

* All authors contributed equally. 

^ To Whom Correspondence Should be Addressed: m.stumpf@imperial.ac.uk 



(N 
> 

00 
(N 

\o 
o 



X 



1 Introduction 

Mathematical models are widely used to describe and analyze complex systems and processes across the natural, 
engineering and social sciences. Formulating a model to describe, e.g. a predator-prey system, geophysical 
process, communication system, or social network requires us to condense our assumptions and knowledge into 
a single coherent framework (May 2004). In constructing these models we have to state our assumptions 



about their constituent parts and their interactions explicitly. Mathematical analysis or computer simulations 
of these models then allows us to compare model predictions with experimental observations in order to test 
and ultimately improve the models. Even previously largely observational sciences such as biology, geology and 
meteorology are now heavily influenced by computer simulations which are employed for explanatory as well as 
predictive purposes. 

Because many of the mathematical models in these disciplines arc too complicated to be analyzed in closed 
form, computer simulations have become the primary tool in the theoretical analysis of very large or complex 
models. Modelling such systems is often (relatively) straightforward if the mathematical structure of the model 
and reliable estimates for the model parameters are known. Unfortunately, it is considerably harder to infer 
the structure of mathematical models and estimate their respective parameters based on experimental data, in 
particular if we seek to infer a model that can describe the data. Whenever probabilistic models exist we can 
employ standard model selection approaches of either a frequentist, Bayesian, or information theoretic nature 



( Cox & Hinkley 1974 Mackay 2003| |Burnham fc Anderson 2002 ) . But if suitable probability models do not 



exist, or if the evaluation of the likelihood is computationally intractable, then we have to base our assessment 
on the level of agreement between simulated and observed data. This is particularly challenging when the 
parameters of simulation models are not known but must be inferred from observed data as well. 

For such cases — cases where conventional statistical approaches fail because of the enormous computational 
burden incurred in evaluating the likelihood — so-called approximate Bayesian computation (ABC) schemes 
have recently come to the fore ( Pritchard et al. 1999[ |Beaumont et al. 2002 Tanaka et al. 2006| Secrier et al. 
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2009 ) . These forgo the explicit evaluation of the likelihood by a principled comparison between the observed and 



simulated data. In many cases inferences are furthermore based not on the data themselves, but on summary 
statistics of the data. Such statistics serve as data compression tools ( Cover & Thomas 2006) and, if used 



sensibly, enable computationally efficient inference from data sets, where the complexity of the data would 



stymie conventional likelihood-based methods (Pritchard et al. 1999 Ratmann et al. 2007) 



ABC schemes have become increasingly popular, because of their flexibility and their deceptive conceptual 
simplicity. While especially some computationally demanding areas have fuelled the development of powerful 



ABC approaches, notably population genetics (Fagundes et al. 20071, evolutionary biology (Wilkinson et al. 



2010), systems biology (Liepe et al. 2010), dynamical systems theory (Toni et al. 2009), and epidemiology (Blum 



& Tran 2008), a worrying increase in naive (and plainly incorrect, see e.g. Walker et al. (2010)) applications are 
beginning to emerge. Such problems, as recent results by Didelot et al. (2010) and Robert et al. (2011) suggest, 
are more imminent in model selection applications. In the present context, all problems stem back to the issue 
of sufficiency of statistics, and its role in model selection. The present paper sets out to develop remedies for 
such problems. Below we will begin with an outline of the basic ideas underlying ABC, before discussing the 
particular challenges raised in particular by Robert and colleagues. We will then list the cases where ABC- 
based model selection is possible; in essence it is the ill-judged use of summary statistics and failure to ensure 
sufficiency which lies at the heart of the problem identified by Robert et al. (2011), before setting out methods 
that allow us to remedy these problems. We illustrate the use of these methods in a number of applications 
before concluding with some more general remarks on the conceptual and mathematical foundations of ABC 
approaches. 



2 Approximate Bayesian Computation 

2.1 Sufficient Statistics 

Bayesian inference centres around the posterior distribution, 

p(x) 

where x are the data, which are drawn from some sample space, x 6 £1 C R D , f(x\6) is the likelihood, ir(9) 



the prior distribution, and 9 an unknown parameter ( |Robert| 2007); p(x) is often called the evidence ( |Mackay 



2003), but in many applications or discussions dismissed as a normalization constant. The Likelihood principle 
states that all the information about parameter 9 is contained in the likelihood function f(x\9), i.e. once we 
have the form of the likelihood, we do not have to retain any of the data. This principle is complemented by 
the sufficiency principle. Here a summary statistic of the general form 

S:R d — > R w , S(x) = s (2) 

with w <C d typically, is called sufficient if the likelihood is independent of the parameter conditional on the 
value of the summary statistic. We denote by f{x\9, s) the likelihood conditionally on the value of the summary 
statistic S(x) = s and g(x\s) the density probability of the data given the summary statistic. The statistic is 
sufficient if and only if: 

f(x\s,9)=g(x\s). 

The likelihood can then generally be written in the Neyman-Fisher factorized form 

f(x\6) = g(x\s)f(s\9), (3) 



where s — S(x) and f(s\6) is the likelihood of the sufficient statistic (Cox 2006 1. The function g(-) is independent 
of the parameter 9. Thus f(s\9) carries all the information about the parameter. 

This factorization is, however, not unique, as it depends on the sufficient statistic, which is generally not 
unique. For example, any statistic containing additional information in addition to a sufficient statistic is also 
sufficient. Therefore we typically seek to determine the minimally sufficient statistic, which is generally unique. 
As a consequence the functional forms (and values) of g(-) and f(-\9) depend on the choice of sufficient statistics. 

In order to understand the terms in the factorization theorem, Eqn. {3}, we focus on the case where A is a 
discrete random variable, we then have 

g{x\s) = Pr(A = x\S(X) = s) 
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and 

f(s\9) = Pr(S(X) = s\9). 

Thus g(x\s) is really the conditional probability of X given an observed value for the summary statistic, S(X) = s 
and it is therefore linked to the compression of the data achieved by the summary statistic. Here it is worth 
remembering that the complete data also form a valid summary statistic, and for this choice of statistic we have 
trivially, for all s, <?(a;|s) = 1, Vx G Q. 



2.2 ABC for parameter inference 

In practical applications we are interested in evaluating the posterior distribution for model parameters, 9, 
defined Eqn. ([lj. When the likelihood is hard to evaluate it is still often possible to simulate from the model 
according to f{-\9). It is easy to show that for simulated data y we have 

^ ) = !H^M, (4 ) 

p[x) 

which in many practical applications can be approximated using suitable distance functions, A(x, y), whence, 
after marginalization over simulated data we get, 

^ I t{A{x,y) < e)f(y\9)Tr(9) 

This is obviously correct, as e — > 0. 

Based on the fact that sufficient statistics contain all the information about the 9 that is contained in the 
data, we may be tempted to replace the data by the corresponding summary statistics. We thus replace the 
comparison of the data in Eqn. ([5| by a comparison of the values of their respective summary statistics, using 
a distance function which, by abusing the notation, is denoted by A(S(x),S(yj), 

^ I t(A(S(x),S{y)) £ e)f(y\9)w(6) 
f MD t(A(S(x), S )<e)f( s \9M9)ds 



J e J Rw t(A(S(x),s) < e)f(s\9)n(9)dsd9 



where we have made it explicit in the second line that once we use summary statistics we are only considering 
the second term on the right-hand side of Eqn. ([3]); any dependence on g(-) is lost, and so is therefore the effect 
of the data-compression in the summary statistic. Something similar is also implicit in conventional Bayesian 



inference. Cox (2006) reinforces this point by stating that 11 Any Bayesian inference uses the data only via the 
minimal sufficient statistic. This is because the calculation of the posterior distribution involves multiplying the 
likelihood by the prior and normalizing. Any factor of the likelihood that is a function of y alone will disappear 
after normalization.'''' 

Quite generally, the choice of the summary statistic is important: without sufficiency the whole inference 
will only map the parameter regimes that will lead to model behaviour which embodies the constraints implied 
by specified summary statistic. Only if the summary statistic is sufficient, however, will we be able to infer 



the model parameters (Fearnhead & Prangle 2010a). For some non-sufficient statistics, however, some aspects 
of the true posterior can be elucidated as shown in Figure [T] We will return to a discussion of non-sufficient 
statistics later. 



2.3 ABC for model selection 

One of the perhaps most useful (and aesthetically pleasing) aspects of the Bayesian inferential frameworks is 



that model selection is natural and intrinsic, especially compared to frequentist frameworks (Robert 2007). 



From the earliest days ABC approaches for model selection have also been promoted, see e.g. |Beaumont et al.\ 
(2002 1 and Fagundes et al. (2007). Recently, however, Robert et al. (2011 ) have issued a note of caution. This is 
based on the observation that a statistic, or set of statistics, which is sufficient for model parameters in different 



models, may still not be sufficient across models (Didelot et al. 2010 Robert et al. 2011). 
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Figure 1: Parameter inference for the mean of a normal model with known standard deviation, a 2 = 1, 
using the mean, variance, maximum and minimum as a statistic. We find that only the truly sufficient 
statistic (9 = fi = 1) yields the correct posterior distribution despite the fact that we generated 1,000 
acceptances of samples of size 10,000 with e = 0.001. 
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To illustrate this point we now consider a finite set of models, Ai = {Mi, . . . , M q }, each of which has 
an associated parameter vector 9 m e m , 1 < m < q. We aim to perform inference on the joint space over 



models and parameters, (m, 9 m ). |Robert et al. (2011 ) have focused on the Bayes Factors, but, of course, similar 
problems arise also for the marginal model likelihoods, 



p(m\x) 



I& m f(x\9 m )ir(9 m )d9 m 1T(m) 



(7) 



Again, we can apply ABC by replacing evaluation of the likelihood in favour of comparing simulated and real 
data for different parameters drawn from the posterior, whence we obtain 



p(m\x) 



J em J Q t(A(x,y) < e)f{y\6M6 m )d9 m dy 7r(m) 
E'=i Se t In y) < e)fmM0i)Midy ' 



(8) 



which is of course always exact once e — > 0. The same is no longer true, however, once the complete data have 
been replaced by summary statistics. So in general 



p(m\x) 7^ 



Je m Jjj™ l(A(S m (x),s m < e)f(s m \9 m )Tr(9 m )d9 m ds m n(m) 
Ei=i U. /r» l(A(^(x), Sl ) < e)~f{s i \9 i )^{9 i )d9 l ds l ' 



(9) 



where Si, 1 < i < q are the summary statistics for each model. An equality can only hold if the factors 
9i(x), 1 < i < q are all identical. Otherwise the different levels of data-compression are lost and unbiased model 
selection is no longer possible. 



2.4 Resuscitating ABC Model Selection 



As shown by Robert et al. (2011) and Didelot et al. (2010) even if we choose a set of statistics that is sufficient 



for parameter estimation across models, this does not guarantee that the same set of statistics are sufficient for 



model selection. R Robert et al. (2011) argue that therefore model selection, though not parameter estimation 



is fraught with problems in an ABC framework. Here we will argue that this is not the case. While we do agree 
that sufficiency and problems when using inadequate (or insufficient) statistics for model selection, we maintain 
that 

• this mirrors problems that can also be observed in the parameter estimation context (see Figure [T]), 

• for many important, and arguably the most important applications of ABC, this problem can in principle 
be avoided by using the whole data rather than summary statistics, 

• in cases where summary statistics are required, we argue that we can construct approximately sufficient 
statistics in a disciplined manner, 

• when all else fails, a change in perspective, allows us to nevertheless make use of the flexibility of the 
ABC framework (see Discussion). 

The bulk of this article will deal with the derivation and application of a method that constructs summary 
statistics appropriate for the twin use of parameter estimation and model selection. After this we will briefly 
return to an outline of alternative approaches and map the applicability of ABC-based model selection more 
generally. 



3 Some Basic Concepts from Information Theory 



Our method is most easily framed in the terms of information theory ( |Mackay| |2003 Cover & Thomas 2006 



Mezard & Montanari 2009), and in order to keep this article self-contained we briefly review some of the basic 



concepts. We let X denote a discrete random variable with potential states X and probability mass function 
Px{%) = Pr (A = x), x £ X. In this section, for sake of clarity, we explicitly denote the random variables as 
subscript of the corresponding probability density. 
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3.1 Entropy and Mutual Information 

The entropy of X, denoted by H, measures the uncertainty of X and is defined as follows, 
H(X) = -J2px(x)log Px (x) = -E x [logpx(X)} = E x 

X 

where Ex denotes the expectation under the probability mass function px ■ Let (X, Y) be a pair of discrete 
random variables with joint distribution px.Y- The conditional entropy H(Y\X) is defined as 

H(Y\X) = -E x ,y [\ogp Ylx (Y\X)] . 

The mutual information I(X;Y) between two discrete random variables X and Y measures the amount of 
information that Y contains about X. It can be seen as the reduction of the uncertainty about X due to the 
knowledge of Y, 

I(X;Y)=H(X)-H(X\Y) = £ Px ,y(x, y) log l^lf- = KL(p x .y WpxPy) > 0, 

where KL(P\\Q) refers to the Kullback-Leibler (KL) divergence between probabilities P and Q. The mutual 
information I(X; Y) is equal to if and only if the random variables X and Y are independent. 
The conditional mutual information of discrete random variables X, Y and Z is defined as 

I(X; Y\Z) = H(X\Z) - H(X\Y, Z); 

it is the reduction in uncertainty of X due to knowledge of Y when Z is given. This quantity is zero if and only 
if X and Y arc conditionally independent given Z, which means that Z contains all the information about X in 
Y. The conditional mutual information satisfies the chain rule: for discrete random variables X\,X 2 , ■ ■ ■ ,X n 
and Y we have 

n 

I(Xi,X 2 , • • • , X n ;Y) — ^ I(Xi;Y\Xi, X 2 , • • • , Xi- 2 , -X't-i)- 

i=l 

In the following we only consider entropy rather than the differential entropy which applies to continuous 
random variables. 



\ogp x {X) 



>0, 



3.2 Data Processing Inequality and Sufficient Statistics 

The data processing inequality (DPE) states that for random variables X, Y, and Z such that X — > Y — >• Z, 
(i.e. Y depends, deterministically or randomly, on X and Z depends on Y) 

I(X-Y)>I(X-Z), 

with equality only if X — > Y — > Z forms a Markov Chain, which means that the random variables X and Z are 
conditionally independent given Y: Px.z\y = Px\yPz\y ■ 

Now consider a family of distributions {/(-|0)}eee an d let X be a sample from a distribution in this family. 
Let S be a deterministic statistic and denote by S the random variable such that S = S(X). Therefore 
6 -S- X S\ By the DPE 

WS)<wx). 

Analogously to the discussion above, a statistic 5 is said sufficient for parameter 9 if and only if S contains all 
the information in X about 9, i.e. 

1(9; S) = 1(9; X) where S = S(X) 

Equivalently we may write: 

Result 1. S is a sufficient statistic for parameter 9 if and only if 

I(9;X\S) = 0. 

In that case, 



Ee,x 



g p(9\S) 



0. 
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From now on, we resume the use of typically bayesian notation (see section 2) where the random variables 
are no longer given as subscripts but are unambiguously inferred from context. For example, the density 
of probability p(6\S) involved in the theorem designates both the posterior probability of the parameter 9 
conditional on S and its value when the parameter is actually equal to 9. 

Proof. By definition, S is a sufficient statistic if and only if 1(9; X) = 1(9; S). S being a deterministic fonction 
of X, the previous equation is equivalent to 

1(9; X, S)- 1(9; S) = 
«• H(9)-H(9\X,S)-H(9) + H(9\S)=0 
«• H(9\S) - H(9\X,S) =0 

I(9;X\S)=0 

If S is a sufficient statistic for parameter 9 then 



I(9;X)=I(9;S) 

Era \ i P(0> x ) i a m s ) 
P ^ a:) log ... , I = > s ) lo § ra\ r \ 

0X P(0)p(x) ^ P(0)p(s) 

Era \ i P(0> x ) ra m P^^W) 



^ ' °p(6)p(S(x)) 

□ 



4 Constructing Sufficient Statistics 

We consider the following situation: suppose that we have a finite set of summary statistics S = {Si, . . . ,S W } 
and assume that S is a sufficient statistic. We aim to identify a subset U of S which is sufficient for 9. The 
following result characterizes such a subset. 

Result 2. LetS be a finite set of summary statistics of X, and assume that S is a sufficient statistic. Denote 
by U a subset of S . For a random variable X distributed according to a distribution parametrized by 9, let 
U = U(X) and S = S(X). The following statements hold 

U is a sufficient statistic 
^ I(9;S\U) = 

E x [KL(p(9\S)\\p(9\U))}=0. 



Proof. By definition of the conditional mutual information 

1(9; S\U) = H(9\U) - H(0\S, U) = H(9) - H(9\S, U) - H(9) + H(9\U) 
= 1(9; U, S) - 1(9; U) = 1(9; S) - 1(9; U) 

since U is a vector composed of elements of S. The statistic S being sufficient, 1(9; S) = 1(9; X), and therefore 
1(9; S\U) — if and only if U is a sufficient statistic. Denote by h the function such that for all x, u = U(x) = 
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h(S(x)), 

1(6; S\U) = H{0\U) - H{6\S, U) = ^ p(0, s, u) log 



P(0\u, s) 
p(6\u) 



2^> a ) log = ^ p(6», a) log 



V p(*IM*)) / V wIM")) 

X>(s)X>(0|s)log (jp||y) =X; P (*)ifi(p(«|*)lb(«|M*))) 
^p(a ; )ifL(p(e|5(x))||^|W( i r))) = ^[^(K^|5')lb(^l^))] • 



□ 



According to this result identifying a sufficient statistic with minimum cardinality from a sufficient family 
S of summary statistics boils down to identifying the smallest subset U of S such that 1(9; S\U) = where 
U = U(X) and S — S(X) or, equivalently, Ex [KL(p(6\S)\ \p(9\U))] = 0. Here we aim to determine a sufficient 
statistic for Approximate Bayesian Computation (ABC) methods for parameter inference and then for model 
selection. We focus in this section on the parameter inference task. In the ABC framework, the expectation over 



the data of the Kullback-Leibler divergence (Cover & Thomas 2006) between the two posterior distributions 



p(9\S) and p(9\U) cannot be exactly computed since we only have at our disposal a dataset x and the value 
of the statistics for this dataset s* = (s*, . . . , s^) = S(x). Thus, we approximate it by the expectation with 
respect to the empirical measure of the data. The method is summarized in Algorithm [l] We denote by \U\ the 
cardinality of a set U. 



Algorithm 1 Minimization of the mutual information 
1: input: a sufficient set of statistics whose values on the dataset is s* = {s*, . . . , s^} 
2: output: a subset U* of s* 
3: for all u* C s* do 
4: perform ABC to obtain p(9\u*) 
5: end for 

6: let T* = {u* C s* such that KL (p(9\s*)\\p(6\u*)) = 0} 
7: return U* = argmin u „ gT , \u*\ 



This methodology is computationally prohibitive since it enumerates all possible subsets u* of s* and perform 
the ABC algorithm for all of these. Moreover, it is challenging to obtain a precise estimate of the posterior 
p(9\u*) of 9 given a value of the statistics U = u* when the cardinality of u* is large. In particular, it is often 
impossible to obtain an estimate of p(9\s*). It is then necessary to design an algorithm which does not need 
the computation of this probability. The following result provides a first step into this direction. 

Result 3. Let X be a random variable generated according to f(-\9). Let S be a sufficient statistic and IA and 
T two subsets of S such that U = IA(X), T — T(X) and S = S(X) satisfy U cTcS. We have 

I(d;S\T)=I(6;S\U)-I(9;T\U) . 

Proof. For all subset T of S, 

1(9; S\T) = H(9\T) - H(9\S, T) = H(9\T) - H(9\S) . 
Then, for U and T subsets of S, 

1(0; S\T) - 1(9; S\U) = H(9\T) - H(9\U) . 
If U in included in T, then H(9\T) = H(6\T, U) and 

1(0; S\T) - 1(9; S\U) = -1(6; T\U) 
which proves the result. □ 
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It follows that the information contained in S on 9 given T is smaller than the information contained in 
S on 9 given U C T. In order to construct a subset T of S such that 1(9; S\T) = 0, where S — S(X) and 
T = T(X), it is thus sufficient to add one by one statistics of S until the condition holds. Indeed, if we denote 
the statistic added at time step k < w by St^) and S/f.) = S(k)(X), then 

1(9; -S'|S'(i) , ■ • ■ , S'(fc)) < 1(0; S\S(i),. ■ . , S(k+i)) ■ 

And there exists an integer k < w such that 1(9; S\S^, . . . , S^y) = 0. According to result|3j at each time fc, 

J(0; S\S(i),. ■ . , S'(fc)) = 1(9; S\S(t),. ■ . , S^-i)) - 1(0; 5(1), • . . , S^)\S^, . . . , S^-i)) ■ 

The mutual information is a non negative function, then, in order to decrease as much as possible 1(9; S\S^, . . . , S 1 ^)), 
the added statistic at time k > 1 should be such that 

S(k) = ar S max Ves\{Sf (1) ,...,s (k _i ) } I (^> %)) ■ ■ • ' V]^!), . . . , 

= argmax yeSUS(i) _ -Ex [i£X (K^i), • • • > ^)lb^l'S'(i), ■ ■ ■ , <%-i)))] ■ 

As previously mentioned, in practice, this expectation can not be computed. We hence replace it by the 
expectation with respect to the empirical measure of the data, leading to the approximation, for all 1 < k < n, 

argmax VeSUS(i)i .. >S(fe _ 1)} £j, ( x) [KL (p(9\S {1) , . . . , ^(jb-i), ^)! |p(^l>^(i), • • • , s (k-i)))] 

~ ar g max o*es*\{^ 1) ,...,^»_ 1) } ^ L (p( \ s *x), • • • ' s (k-i)> u *)lb( l»(i)j • • • i s (fc-i))) < 

where s* is the value of the statistic S on the dataset. In addition, the first statistic £>(i) should contain the 
maximum information about 9, in the sense that 

S(i) = argmax VeS 1(9; V) = argmin VeS H(9\V) = argmax VeS E p{ey) [logp(9\V)} . 

This results in algorithm [2] 

Algorithm 2 Greedy minimization of the mutual information 
1: input: a sufficient set of deterministic statistics whose values on the dataset is s* = {s\, . . . , s^} 
2: output: a subset U* of s* 
3: for all u* £ s* , perform ABC to obtain p(9\u*) 
4: let sis = argmax u * gs » log p(9\u*) 
5: for k 6 {2, . . . , w} do 

6: for all u* G s*\ {s^p • • • , s (fc_i)}j perform ABC to obtain p(9\s* l y . . . ,s* k ^iyU*) 
7: let 

s (k) = argmax„* KL(p(9\s* 1) , . . . , s* k _ 1} ,u*)\\p(9\s* {1) , s* fc _ x) )) (10) 

8: if ^L(^| S * 1)5 . . . , s* {k) )\\p(6\ S * {ir . . . , ^ fe _ 1} )) < e then 

9: return L7* = (S( lV . . . 

10: end if 
11: end for 
12: return U* = s* 



This algorithm is computationally expensive if the cardinality of the set of statistics |<S| = w is large. Indeed 
at each iteration k, one performs the ABC algorithm w — k + 1 times. A simplification of it consists in replacing 
the maximization step (see equation (10)) by testing randomly various statistics and choosing a statistic Sr k ) 
such that 1(9; S^i), • • ■ , S7fc)|iS(i), . . . , S(k-\\) is large. Different criteria may be used to determine if the mutual 
information is large or not, and then decide if the statistic should be included or not. Most of these criteria consist 
of determining if the posterior probability of 9 given Sn), . . . , ^(fe-i) and the posterior probability of 9 given 
<S7i), . . . , S(ty are significantly different. If so, adding the statistic S( k ) is justified, otherwise we do not add it and 

instead turn to a different statistic. In the algorithm 3 1 we denote by C 
the criterion which is equal to 1 if the statistic S^k) snould be added and otherwise. In practice, we can use, 
for instance, one of the following measures: 



Pi 



'(i) 



'(AO 



),p(0\s 



(i)' 



'(fe-i) 
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• we add the statistic S^k) if KL(p(8\s*^ 1 - ) , . . . , s^OHp^la^j, ■ • • , s *(k-i))) — ^ k where 5k is a threshold which 
may be computed by bootstrapping the data to estimate E p ^ X ) \p{Q\S(i), • • • ^Vfe-i))] '■ 

5 h = KL (p(6\s* {1) ,. . . , s* (k) )\\E pix) [p(e\S (1)) . . . , %-i))]) , 

• the Kolmogorov-Smirnov test enables us to compare p(#|slj, . . . , s? fe _-^) and p(0|s?U, . . . , and so, 
the statistic 5(m is added if the test has a p- value smaller than a certain threshold, say 0.01. 

Algorithm 3 Stochastic minimization of the mutual information 
1: input: a sufficient set of deterministic statistics whose values on the dataset is s* = {s\, . . . , s^} 
2: output: a subset V* of s* 
3: choose randomly u* in s* 

4: T* <- 

5: V* <- n* 

6: repeat 
7: repeat 

8: if T* = then return 

9: end if 

10: choose randomly u* in T* 

11: T* <- T*\n* 

12: perform ABC to obtain p(8\V*,u*) 

13: until C [p(6\V*,u*),p(e\V*)] = 1 
14: T* ^— s*\ {V*,u*} 

15: optionally: V* OrderDependency (V* , u*) and T* ^— s*\F* 
16: 1/* ^ U n* 

17: until T* = 

18: return 



The order in which the statistics are added matters (Joyce & Marjoram 2008 Nunes & Balding 20101 and 
the only way to avoid this inconvenience is to use the computationally expensive algorithm |2| Nevertheless, we 
suggest to test for order dependency before deciding on whether to add a statistic. This consists of determining 
if, given the recently added statistic u* , the previously added statistics in V* still bring relevant information or 
are not necessary anymore and hence may be released from the set under construction. More precisely, after 
line 15 of the algorithm |3j we add the function described in algorithm |4| 

Algorithm 4 Order Dependency 



Input: A set of accepted statistics V* = ■ ■ ■ > s (&_],)} an d the last accepted statistic u* 
Output: A subset U* of {V* U u*} 

U* <- u* 

for i G {1, . . . , k — 1} do 

if C(p(9\U*,s* (i) ),p(e\U*)) = lthen 
U* <- U* U s* 

w 

end if 
end for 
return U* 



5 Relation to previous work 

The presented information theoretic framework builds upon two previous methods for summary statistic selec- 
tion. Our main contributions are the generalisation of the notion of approximate sufficiency, rigorous derivations 
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of algorithms using information theory and the application of summary statistic selection for the joint space. 



Joyce & Marjoram ( 2008 1 developed a notion of approximate sufficiency for parameter inference and pre- 



sented a sequential algorithm to score statistics according to whether their inclusion would improve the inference. 
Their sequential algorithm resembles (and indeed inspired) Algorithm 3 although they do not retain statistics 
once they have failed to be added which we feel is required since whether a statistic is added depends strongly 
on the statistics already accepted. Their rule for adding statistics is essentially an approximate test for inde- 
pendence on the posterior distribution under the addition of a new statistic but can only be used for single 
parameter models. We have shown that the true stopping criterion should be the change in KL divergence 
which can be used for multivariate posteriors, although tests for independence (KS, \ 2 ) can be used as an 
approximation in single parameter models. 



Nunes & Balding (2010) proposed a heuristic algorithm to minimize the entropy of the posterior with 



respect to sets of summary statistics . Additionally they proposed a refinement step where the set of statistics 
that minimised the posterior mean root sum of squared errors (MRSSE) was selected. The minimum entropy 
approach is related to Algorithm 1 since, when H(9) is constant, minimising the entropy maximises the mutual 
information. However, assuming there exists a sufficient statistic, choosing the set of statistics that minimises 
the entropy is guaranteed to give sufficiency but not minimal sufficiency since adding a statistic to a sufficient 
set can reduce the entropy by chance (a manifestation of "conditioning always reduces entropy"). 



6 Automated selection of summary statistics for model selection 

As pointed out recently sufficiency across models is still not sufficient to perform reliably model choice in the 
ABC framework. Here we show how a natural extension of the methodology introduced above can also be 
employed in order to construct sets of statistics that arc sufficient for ABC-based model selection, when it is 



impractical to use the raw data (Toni & Stumpf 2010) 



Consider q models, each with an associated set of parameters Qi,i £ {1, ...,(?}. We aim to identify a set of 
sufficient statistic for model selection. Let M being a random variable taking value in {1, . . . , q}. A statistic is 
sufficient for model selection if and only if it is sufficient for the joint space {M, {9i}\<i< q }. According to result[Tj 
this means that a summary statistic S is sufficient for model selection if and only if I(M, 9\, 6 q ;X\S) = 
where S = S(X) and X is a sample from a distribution in the family {/(•|^j)}e i ee i , i<i<q- The following result 
enables us to link this condition with the sufficiency for parameter inference for each model. 

Result 4. For all deterministic statistic S, 

I(M, 9 1} 9 q ; X\S) = I(M; X\6 U 9 q , S) + £ W;X\S) , 

i 

where S = S(X) and X is a sample from a distribution in the family {/(•|0i)}e i e0j, Ki<q- 
Proof. From the chain rule of mutual information we have that 

I(M,9 1 ,....,9 q ;X\S)=I(M;X\9 u ....,9 q ,S) + ^I(6 i ;X\e 1 ,--- A-i,S) ■ 

i 

The result then follows from the fact that, for all i, 1(9,; X\6i, ■■■ , 0i-i,S) = I{9 l ;X\S). □ 

The mutual information being always non negative, this shows that a statistic S is sufficient for model 
selection if and only if I(M; X\9\, 9 q , S) — and 1(9^ X\S) — for all i G {1, • • • , q}. Therefore, a sufficient 
statistic for model selection is sufficient for parameter inference in each model and, given the parameter values 
for every models, the statistic is sufficient for inferring the model. Thus, in order to identify a sufficient statistic 
for model selection, one should determine a set of minimal sufficient statistics S" H for each model 1 < i < q 
and then identify among all statistics T containing Ui S {i... g }6>" li the one with the smallest cardinality such that 
I(M; q , T) = 0. The method, summarized in algorithm [5j consists in running one of the previous 

algorithms, for example algorithm [3] and then add statistics which bring new information about the models in 
the sense that the posterior probability of the model conditional on the statistics varies significantly if we add 
a this new statistic. 

Similarly to algorithm [3j it is possible to test for order dependency before deciding to add a statistic. To 
do so, we apply the algorithm [4] in which at step 3, the set U* is initialized by M* U u* such that we always 



keep the set M * and condition line 5 is replaced by C 



p{m\6 u ...,e q , u*, sy^iMiet, ...,e q ,u*) 



i. 



n 



Algorithm 5 Stochastic minimization of the mutual information for model selection 
1: input: a sufficient set of deterministic statistics whose values on the dataset is s* = {s\, . . . , sjj,} 
2: output: a subset V* of s* which is sufficient for model selection 
3: for i € {1, . . . , q} do 

4: determine a sufficient statistic whose value on the dataset is S mi C s* using algorithm [3] 
5: end for 

6: Let M* = Ui<i< g 5 m » 
7: Let W* <- s*\M* 
8: choose randomly u* in W* 
9: T* <r- W*\{u*} 

io: y* U* 

11: repeat 
12: repeat 

13: if T* = then return V* 

14: end if 

15: choose randomly u* in T* 

16: T* <- T*\u* 

17: perform ABC to obtain p(M|6>i, . . . , 6 q , M*,V*,u*) 

18: until C[p(M\e 1 ,...,9 q ,M*,V*,u*),p(M\9 1 ,...,9 q ,M*,V*)] = l 

19: T* ^— PP*\ {V*,u*} 

20: optionally: V* <- OrderDependenceModelSelec (V*,M* U u*) and T* <- W*\V* 
2i: y* <- V* U U* 

22: until T* = 

23: return 



7 Applications 

We illustrate the framework developed above in three different contexts. First we consider a simple model 
selection problem involving two normal distributions. We then consider a typical population genetics example 
on three demographic scenarios for simulated data, before finally turning to a problem where we consider 
alternative random walk models; this last example should be typical for applications where likelihood-based 
inferences are out of the question due to the complexity of the models and the data. 

7.1 Normal example 

The developed framework was illustrated on a simple example with two models 

2/1, —Vd ~J\f{n,a-f) and y lt ...y d ~ J\f(fi, a\), 

where the variances of the normal distributions, <j\ and 02, are fixed. Under a conjugate prior, /j ~ 7V"(0,a 2 ), 
the true Bayes factor is given by 

ar d cxp{-5 2 /2a2}exp{-y 2 /2(a 2 + <r{ / d)} J a - 2 + da~ 2 

BF(y) = V , 

a 2 - d cxp{-5 2 /2a 2 }cxp{-y 2 /2(a 2 + al/d)}^- 2 + rfaf 2 

where 

d d 
y = d~ 1 ^2y i and S 2 = ^(Vi - y) 2 ■ 

»=1 i=l 

In this case y is sufficient for parameter inference but the pair {y, S 2 } is sufficient for the joint space. 

To test the automated choice of approximate summary statistics, 100 data sets were sampled under model 
1 and the algorithm run to select statistics for parameter inference and for the joint space from a pool of 5 
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statistics including y, S" 2 , range, maximum and a non informative statistic u ~ U(0,2). The values of the 
parameters were chosen to be o\ — 0.3, oi — 0.6, n = 15 and a = 2. Since there are difficulties that arise from 
the possibly very different scales of the summary statistics the distance was defined as 

A(S(x),S(y)) = ^ogS t (x) -log£ 2 (y)] 2 , 

i 

which accounts for the relative difference between data and simulation and avoids the need to known the scales of 
the statistics a priori. Here, Si(x) denotes the i-th component of S (x) £ R^The stopping criteria were defined 
through tests for independence (p < 1 x 10~ 5 ) between the posterior distributions under different summary 
statistics; a Kolmogorov-Smirnov test in the case of the continuous parameter posterior and a Pearson test in 
the case of the discrete model posterior. The ABC was run with 500 particles with e = O.f . 

Figure [2] shows the results of the summary statistic selection over the 100 runs. Figure [2] (left) shows the 
results for parameter inference across the two models. The mean was selected in every replicate, the maximum 
in 14 replicates and S 2 once. Figure [2] (right) shows the additional statistics selected for the joint space. S 2 was 
selected in 84 cases, the range in 19 cases, the maximum in 9 cases and the noise statistic once. Figure [3] shows 
the Bayes Factor obtained via ABC versus the analytical prediction. As expected the Bayes factor calculated 
using only the statistics selected for parameter inference is uncorrelated with the true Bayes factor (figure on 
the left). When the statistics sufficient for the joint space are included the ABC Bayes factor correlates well 
with the analytical prediction. 



7.2 Population genetics example 

To further demonstrate the efficacy of our methodology we applied the summary statistic selection procedure 
to a real world ABC problem, that of model selection in population genetics. Data were generated using 



coalescent simulations (Hudson 1991 ) from three competing models, producing 100 data sets from each for fixed 



parameters using a modified version of the MS software package downloadable from http://home.uchicago 



|edu/rhudsonl/sourc e/mksamples .html |. 
The models considered were: 

Model 1 Constant population size (with population mutation rate 9 = 20, corresponding to a 20,000bp stretch 
of DNA in a population of size N = 1, 000, 000. 

Model 2 Exponential growth model with exponential growth rate 7 = 0.4 and all other parameters as above. 

Model 3 Two island model with scaled migration rate m = 10 and all other parameters as above. 

To perform ABC we generated 5, 000, 000 samples of datasets comprising 100 chromosomes for each model; 
in each case the population mutation rate 9 (Ewens 2004) was drawn from the prior U(5, 30); the real value for 



which the data were generated was 9 — 20 (measured in units of total population size). The summary statistics 
summarised below were then used in our ABC summary statistic selection framework to derive sufficient sets 
of summary statistics for model selection on the observed data. The summary statistics calculated were: 

51 Number of Segregating Sites, N$. 

52 Number of Distinct Haplotypes, Nh- 

53 Homozygosity, h^, where h is the probability that two haplotypes are identical, 

n h 

h H = Y^ v l- 

h=l 

54 Average SNP Homozygosity, 

N s 
i=l 

55 Number of occurences of most common haplotype, 

56 Mean number of pair-wise differences between haplotypes, T. 
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1 00 




mean S2 range max random mean S2 range max random 



Figure 2: Summary statistics selected in 100 runs of the automated summary statistic selection. Left: 
Statistics selected for parameter inference (the union of statistics found under model 1 and model 2). 
Right: Additional statistics chosen for the joint space. 




log(BF) predicted log(BF) predicted 

Figure 3: Predicted vs approximated log Bayes Factor for the normal toy model. Left: The case for 
sufficient statistics selected for parameter inference. Right: The case for sufficient statistics selected 
for the joint space. In both cases the red line represents the line y = x. 
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S7 Number of Singleton Haplotypes, f s jj 



58 Number of Singleton SNPs, f aS . 

59 Linkage Disequilibrium measured by 

N S (N S - 1) Mi)M*>o(j)Mj) 



510 Fraction of pairs of loci which violate the four-gamete test, i.e. for which the two-locus haplotypes 00, 01, 

10 and 11 exist. 

511 Random variable, p ~ U\q i\- 

where N is the number of sequences in the data, and for each SNP locus, i, let vq(i) denote the frequencies 
of the ancestral and derived alleles; further for any haplotype h, is the corresponding frequency. 

The results of the selection process when performed over 100 different simulated data sets from each of 
the three models considered are shown in figure [4j It is apparent that the chosen statistics vary between data 
sets generated quite considerably — this is to be expected as the statistics required for sufficiency will vary 
depending on the data. For data generated by all of the models it is apparent that S4, the average SNP 
homozygosity, is selected often, whilst as expected the uninformative random statistic Sll is rarely chosen. For 
data generated under the exponential growth model, S4, as well as S3 (homozygosity), S6 (mean number of 
pair- wise differences between haplotypes) and S9 (linkage disequilibrium), appear to be favoured by the model 
selection approach. Data generated from this model apparently requires more statistics than data generated 
from the Null model to achieve sufficiency. The method applied to data generated from the two island model 
selects statistics S4 and S9 often, interestingly seeming to require fewer statistics than the exponential growth 
model to perform model selection. 

This is a new and initially perhaps surpising finding: the summaries chosen by our model selection approach 
depend subtly on the true data-generating model. This is, by hindsight, however, not unexpected: we are trying 
to achieve sufficiency for model parameters first, and then pool the statistics required to do just that for all 
models, before refining this set of statistics in order to obtain sufficiency for model selection. As some models 
will generate data that is more difficult to obtain under other models than is the case vice versa, such relative 
biases will affect the set of statistics chosen. In light of population genetics theory, therefore, our observations 



are completely in line with our understanding of coalescent processes (see e.g. Hein et al. (2005)). 



7.3 Random walk models 

We also apply our framework to the problem of model selection on random walks, using a number of summary 
statistics. The models (Rudnick & Gaspari |2010[ ) under consideration were: 

Model 1 Brownian motion. 

Model 2 Persistent random walk (where the walk is more likely to continue in the same direction over successive 
steps but does not have a particular favoured orientation). 

Model 3 Biased random walk (where one direction is favoured). 

We used five summary statistics, that are individually not sufficient in more than one dimension for any of 
the random walk models: 



51 Mean square displacement. 

52 Mean x and y displacement. 

53 Mean square x and y displacement. 

54 Straightness index. 

55 Eigenvalues of gyration tensor (reference random walks book). 
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S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 



Figure 4: Summary statistics selected in 100 runs of the automated summary statistic selection 
procedure on simulated data sets from our three population genetics models, (each run is performed 
on a different simulated observed data point). A) Statistics chosen for model selection with data 
generated from model 1. B) Statistics chosen for model selection with data generated from model 2. 
C) Statistics chosen for model selection with data generated from model 3. 
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Since the models have multiple parameters, we can no longer apply a x 2 test to select sufficient statistics, 



and so instead we approximate the KL-divergence using the posterior, applying the formula described in (Boltz 



et al. 2007 ), 

Nv 

KL(p X \\p Y ) » log jj-Z-j + dEu[logp k (;V)} - dEullogpki-, U)], (11) 

where U and V are the sets of posterior particles drawn from distributions px and py respectively, d is the 
number of parameters and Ej/[log/5fc(-, V)] is the expectation of the distance to the fcth nearest neighbour in 
the set of particles V, Pk(u, V) of each particle u £ U. 

Applying formula (11) in our summary statistic selection framework to data simulated from the three 
different models over 100 runs, the statistics shown in figure [5] are chosen. Again it is apparent that there are 
some differences in the selected statistics for different data sets generated. Looking at the summary statistics 
selected by our method, statistic S5, the eigenvalues of the gyration tensor, a measure of the anisotropy of the 
random walk, appears to be chosen often for data generated by all three models. There also appears to be 
a slight preference for statistic S2, the mean x and y displacement, which can be understood given that this 
statistic is necessary for sufficiency for parameter inference on the biased random walk model. We need to stress 
that the structure of the data here is complex and summary statistics are expected to be hugely variable. 



8 Discussion 



Sufficient statistics are rare; in convenient form — i.e. where the number of statistics is equal to the number of 
parameters to be estimated — they are restricted to problems that can be described in terms of models that 



belong to the exponential family (Lehmann & Casella 1993 Didelot et al. 2010). As previous authors have 



pointed out it is necessary to develop methods that construct sets of statistics that are (at least approximately 
(|Le Cam] |1964j |Kusama[ |1976[ )) sufficient ( |Joyce fc Marjoram) |2008| |Nunes fc Balding) |2010| |Fearnhead ~fc 
Prangle| " 2010a|b[ ). It is either this, or reinterpreting ABC-based inferences not as approximations to the full 



Bayesian (and thus likelihood-based) apparatus but as inference procedures in their own right (Wilkinson 



2008 Drovandi et al. 20111, potentially systematically biased or for approximate models. A third approach, 



previously advocated, is to consider model checking rather than model selection as a viable way of ensuring that 
only appropriate models are calibrated against data. We believe that the latter position fails to acknowledge 
the role of sufficiency of statistics also in the context of parameter estimation; and we will briefly return to this 
point below after having addressed the other two points. 

All methods aimed at constructing collections of statistics that taken together are (approximately) sufficient 
will fail, almost trivially, unless an exhaustive set of summary statistics can be envisioned which fulfils the 
sufficiency criteria as outlined above. If that is not the case, then we might naively expect that all candidate 
summary statistics from our starting set S will be included in IA. This, however, need not (and we believe 
generally will not) be the case, as the information theoretical framework will tend to bias against inclusion of 
statistics that are in some way co- linear to any statistics that are already included in the constructed set. It 
is, of course, in principle possible to use the KL divergence with respect to the distribution obtained with the 
full data as an overall benchmark, but in cases where this is indeed possible, it may be best to use the full data 



(see e.g. Toni et al. (2009)) for inference rather than risk the information reduction inherent to most summary 
statistics. 

There has been much interest in trying to interpret ABC not solely as an approximation to the "true" 
posterior, but as an inferential framework in its own right (Wilkinson 2008 Drovandi et al. 20111. This is 



perhaps an attractive option. One way of achieving this shift in perspective is to consider distributions such as 

P (M,e\s) 

as distributions which specify the probability of a parameter and model being in concordance with a given 
summary statistic. If all we care about is that a model and parameter combination have high (or low) probability 
of producing data with certain mean/maximum/minimum or any other summary statistic value, then this is 
perfect. It is easy to envisage scenarios where we are only interested in certain aspects of the data (such as 
maximum water levels). ABC methods can be used to infer parameters (and models) that are more likely to 
give rise to simulated data that shares some but not all characteristics of the data. Interestingly, this would 



also allow us to employ ABC as a design tool (Barnes et al. 2011): we specify the data (or system behaviour) 
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S1 S2 S3 S4 S5 



Figure 5: Summary statistics selected in 100 runs of the automated summary statistic selection on 
different simulated data sets, considering three different random walk models. A) Statistics chosen for 
model selection with data generated from model 1. B) Statistics chosen for model selection with data 
generated from model 2. C) Statistics chosen for model selection with data generated from model 3. 
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that we would like to observe and infer parameters (and models) which have high probability of producing these 
types of behaviour. 

While this may perhaps seem like sophistry it does also have serious implications for model checking: any 
ABC approach that is based on summary statistics will infer model parameters (or marginal model posteriors) 
that reflect the behaviour encoded by these statistics. Thus we can no longer use these same statistics for model 
checking. This reflects the need to use non-sufficient summary statistics for model checking from Bayesian 
posterior predictive distributions: if we perform inference under a model for which a sufficient statistic exists, 
then calculating the same statistic for replicate data generated from the posterior predictive distribution will 
result in test statistics that are in line with the observed data, irrespective of the validity of the model. Hence 
some authors, in particular Gelman et al. (2003), strongly advocate the use of graphical model checking tech- 
niques over numerical tests. We feel that the situation in ABC reflects some of the same problems that are also 
encountered in model checking. Thus in an ABC framework, irrespective of whether the statistics are sufficient 
or not, the posterior distributions reflect the choice of statistics and the same statistics are therefore ill-suited 
for model checking. 

We conclude by reiterating that ABC approaches employing summary statistics rather than the whole 
data have to fully engage with the level of information-loss inherent to summary statistics. Notions of simple 
sufficient statistics probably do not apply for most scientifically interesting and challenging problems and the use 
of statistics rather than the real data will always result in loss of information. Our approach is based around the 
assessment of information loss and allows the principled construction of sets of statistics (from a candidate set) 
that capture as much as possible from the observed data. While not a panacea, it is within the computational 
reach of ABC practitioners and makes information loss due to inadequate use of statistics apparent, for both 
the parameter and model selection problems. 
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